
NVIDIA · Chat / LLM · 120B Parameters (12B Active) · 256K Context (up to 1M)

Function Calling Streaming Reasoning Agent Workflows Long Context Code Tool UseOverview
NVIDIA Nemotron-3 Super 120B A12B FP8 is an open-weight LLM built for agentic reasoning and high-volume enterprise workloads. Using a hybrid LatentMoE architecture (Mamba-2 + MoE + Attention) with Multi-Token Prediction (MTP) and native NVFP4 pretraining on 25T tokens, it delivers up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B. With a native 1M-token context window, configurable thinking mode, and 60.47% on SWE-Bench Verified, it is purpose-built for collaborative agents, long-context reasoning, and IT automation across 7 languages — served instantly via the Qubrid AI Serverless API.⚡ 2.2x throughput vs GPT-OSS-120B. 1M token context. 512 experts, 22 active per token. Deploy on Qubrid AI — no H100 cluster required.
Model Specifications
| Field | Details |
|---|---|
| Model ID | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 |
| Provider | NVIDIA |
| Kind | Chat / LLM |
| Architecture | LatentMoE — Mamba-2 + MoE + Attention hybrid with MTP; 512 experts, 22 active per token; 120B total / 12B active |
| Parameters | 120B total (12B active per inference pass) |
| Context Length | 256K Tokens (up to 1M) |
| MoE | No |
| Release Date | March 11, 2026 |
| License | NVIDIA Nemotron Open Model License |
| Training Data | 25T token corpus (NVFP4 native pretraining): web, code, math, science, multilingual; post-training cutoff February 2026; pre-training cutoff June 2025 |
| Function Calling | Supported |
| Image Support | N/A |
| Serverless API | Available |
| Fine-tuning | Coming Soon |
| On-demand | Coming Soon |
| State | 🟢 Ready |
Pricing
💳 Access via the Qubrid AI Serverless API with pay-per-token pricing. No infrastructure management required.
| Token Type | Price per 1M Tokens |
|---|---|
| Input Tokens | $0.10 |
| Input Tokens (Cached) | $0.04 |
| Output Tokens | $0.50 |
Quickstart
Prerequisites
- Create a free account at platform.qubrid.com
- Generate your API key from the API Keys section
- Replace
QUBRID_API_KEYin the code below with your actual key
💡 Temperature & Top P: Usetemperature=1andtop_p=0.95— recommended for all tasks with this model.
Python
JavaScript
Go
cURL
Live Example
Prompt: What are the benefits of renewable energy?
Response:
Playground Features
The Qubrid AI Playground lets you interact with Nemotron-3 Super 120B directly in your browser — no setup, no code, no cost to explore.🧠 System Prompt
Define the model’s role, reasoning mode, and output constraints before the conversation begins. Particularly powerful for agentic pipelines, tool-use workflows, and structured enterprise tasks.Set your system prompt once in the Qubrid Playground and it applies across every turn of the conversation.
🎯 Few-Shot Examples
Guide the model’s output structure and reasoning depth with concrete examples — no fine-tuning required. Especially effective for structured outputs and multi-step agentic tasks.| User Input | Assistant Response |
|---|---|
Ticket: "Server keeps crashing every 12 hours." Priority? | Priority: HIGH. Category: Infrastructure Stability. Suggested action: Check system logs for OOM events, review cron jobs scheduled near crash window, and verify disk I/O health. |
Summarize this 50-page policy document in 5 bullet points | • Scope: Applies to all employees handling customer PII. • Key requirement: Data must be encrypted at rest and in transit. • Breach protocol: Notify DPO within 72 hours. • Retention: 7-year maximum. • Non-compliance: Subject to disciplinary review. |
💡 Stack multiple few-shot examples in the Qubrid Playground to shape agentic behavior, output schema, and reasoning verbosity — no fine-tuning required.
Inference Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| Streaming | boolean | true | Enable streaming responses for real-time output |
| Temperature | number | 1 | Controls randomness in output. Recommended: 1.0 for all tasks |
| Max Tokens | number | 16000 | Maximum tokens to generate |
| Top P | number | 0.95 | Controls nucleus sampling. Recommended: 0.95 for all tasks |
Use Cases
- Agentic workflows and multi-agent collaboration
- Long-context reasoning (up to 1M tokens)
- IT ticket automation and high-volume enterprise workloads
- Complex tool use and multi-step function calling
- RAG (Retrieval-Augmented Generation)
- Software engineering and cybersecurity triaging
Strengths & Limitations
| Strengths | Limitations |
|---|---|
| LatentMoE: 512 experts / 22 active per token at same compute cost as standard MoE | Requires minimum 2× H100-80GB GPUs for local deployment |
| 2.2x throughput vs GPT-OSS-120B; 7.5x vs Qwen3.5-122B | Thinking mode adds latency overhead; low-effort mode recommended for simple queries |
| 60.47% SWE-Bench Verified; 83.73% MMLU-Pro; 79.23% GPQA | Not optimized for vision or multimodal inputs |
| Native 1M token context — 91.75% on RULER @ 1M | Function calling supported but may need prompt engineering for complex schemas |
| MTP speculative decoding: 3.45 avg acceptance length (up to 3x wall-clock speedup) | |
Configurable reasoning mode via enable_thinking=True/False |
Why Qubrid AI?
- 🚀 No infrastructure setup — 120B MoE served serverlessly, pay only for what you use
- 🔁 OpenAI-compatible — drop-in replacement using the same SDK, just swap the base URL
- 💰 Cached input pricing — $0.04/1M for cached tokens, critical for long-context and repeated RAG workloads
- ⚡ Throughput-optimized — Nemotron’s 2.2x speed advantage is fully realized on Qubrid’s low-latency infrastructure
- 🧪 Built-in Playground — prototype with system prompts and few-shot examples instantly at platform.qubrid.com
- 📊 Full observability — API logs and usage tracking built into the Qubrid dashboard
Resources
| Resource | Link |
|---|---|
| 📖 Qubrid Docs | docs.platform.qubrid.com |
| 🎮 Playground | Try Nemotron-3 Super 120B live |
| 🔑 API Keys | Get your API Key |
| 🤗 Hugging Face | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 |
| 💬 Discord | Join the Qubrid Community |
Built with ❤️ by Qubrid AI
Frontier models. Serverless infrastructure. Zero friction.
Frontier models. Serverless infrastructure. Zero friction.